7 research outputs found
Recommended from our members
Spreadsheet Tools for Data Analysts
Spreadsheets are a natural fit for data analysis, combining a simple data storage and presentation layer with a programming language and basic debugging tools. Because spreadsheets are accessible and flexible, they are used by both novices and experts. Consequently, spreadsheets are hugely popular, with more than 750 million copies of Microsoft Excel installed worldwide. This popularity means that spreadsheets are the most popular programming language on the planet and the de facto tool for data analysis.
Nevertheless, spreadsheets do not address a number of important tasks in a typical analyst\u27s pipeline, and their design frequently complicates them. This thesis describes three key challenges for analysts using spreadsheets. 1) Data wrangling is the process of converting or mapping data from a raw form into another form suitable for use with automated tools. 2) Data cleaning is the process of locating and correcting omitted or erroneous data. 3) Formula auditing is the process of finding and correcting spreadsheet program errors. These three tasks combined are estimated to occupy more than three quarters of a data analyst\u27s time. Furthermore, errors not caught during these steps have led to catastrophically bad decisions resulting in billions of dollars in losses. Advances in automated techniques for these tasks may result in dramatic savings in both time and money.
Three novel programming language-based techniques were created to address these key tasks. The first, automatic layout transformation using examples, is a program synthesis-based technique that lets spreadsheet users perform data wrangling tasks automatically, at scale, and without programming. The second, data debugging, is technique for data cleaning that combines program analysis and statistical analysis to automatically find likely data errors. The third, spatio-structural program analysis unifies positional and dependence information and finds spreadsheet errors using a kind of anomaly analysis.
Each technique was implemented as an end-user tool---FlaskRelate, CheckCell, and ExceLint respectively---in the form of a point-and-click plugin for Microsoft Excel. Our evaluation demonstrates that these techniques substantially improve user efficiency. Finally, because these tools build on each other in a complementary fashion, data analysts can run data wrangling, cleaning, and formula auditing tasks together in a single analysis pipeline
FlashRelate: Extracting relational data from semi-structured spreadsheets using examples.
Abstract Spreadsheets store a tremendous amount of important data. One reason spreadsheets are so successful is that they are both easy to use and allow users great expressiveness in storing and manipulating their data. This flexibility comes at a price, as presentation elements are often combined with the underlying data model. As a result, many spreadsheets contain data in ad-hoc formats. These formats complicate the use of traditional relational tools which require data in a normalized form. Normalizing data from these formats is often tedious or requires programming, and often, a user may prefer the original presentation. We describe an approach that allows users to easily extract structured data from spreadsheets without programming. We make two contributions. First, we describe a novel domain specific language called FLARE that extends traditional regular expressions with spatial constraints. Second, we describe an algorithm called FLASHRELATE that can synthesize FLARE programs from user-provided positive and negative examples. Using 43 benchmarks drawn both from a standard spreadsheet corpus and from Excel user-help forums, we demonstrate that correct extraction programs can be synthesized quickly from a small number of examples. Our approach generalizes to many data-cleaning tasks on semi-structured spreadsheets
An Abstract Interpretation Framework for Input Data Usage
Data science software plays an increasingly important role in critical decision making in fields ranging from economy and finance to biology and medicine. As a result, errors in data science applications can have severe consequences, especially when they lead to results that look plausible, but are incorrect. A common cause of such errors is when applications erroneously ignore some of their input data, for instance due to bugs in the code that reads, filters, or clusters it.
In this paper, we propose an abstract interpretation framework to automatically detect unused input data. We derive a program semantics that precisely captures data usage by abstraction of the program’s operational trace semantics and express it in a constructive fixpoint form. Based on this semantics, we systematically derive static analyses that automatically detect unused input data by fixpoint approximation.
This clear design principle provides a framework that subsumes existing analyses; we show that secure information flow analyses and a form of live variables analysis can be used for data usage, with varying degrees of precision. Additionally, we derive a static analysis to detect single unused data inputs, which is similar to dependency analyses used in the context of backward program slicing. Finally, we demonstrate the value of expressing such analyses as abstract interpretation by combining them with an existing abstraction of compound data structures such as arrays and lists to detect unused chunks of the data.ISSN:0302-9743ISSN:1611-334